{r, print_organoids_absolute_copynumber} # head(organoids_absolute_copynumber) #This code seeks to explain how do organoids from patients reflect the generality of OV patients. We do clustering of patients and organoids and see where organoids fall – based on signatures, and also based on raw CN profiles.
The previous data are:
(total of 659 samples)
BriTROC: there are the original BriTROC segments (from NatGen paper) and new BriTROC segments (called BriTROC 2, here, but it’s not the BriTROC-2 cohort, it’s still the BriTROC-1 dataset, made by Ruben).
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5
org<- as(readRDS("data/organoid_exposures.rds"), 'matrix')
# rownames(org) <- paste0('Sample ', 1:nrow(org))
# names_orgs = readxl::read_xlsx("data/NewOrganoidNaming.xlsx")
names_orgs = read_csv("data/NewOrganoidNaming.csv")
## Rows: 18 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): new name
## dbl (1): old name
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names_orgs$`new name`[match(rownames(org), paste0(names_orgs$`old name`, 'org'))]
## [1] "PDO16" "PDO15" "PDO3" "PDO9" "PDO5" "PDO6" "PDO10" "PDO1" "PDO12"
## [10] "PDO4" "PDO2" "PDO18" "PDO7" "PDO17" "PDO8" "PDO13" "PDO14" "PDO11"
rownames(org) = names_orgs$`new name`[match(rownames(org), paste0(names_orgs$`old name`, 'org'))]
rename_rows = function(i, new_names){
rownames(i) = new_names; return(i)}
## Creating plot... it might take some time if the data are large. Number of samples: 18
## Creating plot... it might take some time if the data are large. Number of samples: 18
Apparently only PDO1 is a WGD sample. I see which log-ratio with any other signature shows a clear difference between this WGD sample and the rest. The ratio between s4 (WGD signature) and s2 is.
## Var1 Var2 value group
## 1 PDO1 1 0.2064604 WGD
## 2 PDO1 2 13.6860836 WGD
## 3 PDO1 3 3.9450050 WGD
## 4 PDO1 4 1.0000000 WGD
## 5 PDO1 5 Inf WGD
## 6 PDO1 6 3.3746386 WGD
## 7 PDO1 7 Inf WGD
## Warning: Removed 11 rows containing missing values (geom_point).
apply(org, 2, function(j) table(factor(j==0, levels=c(T,F))))
## s1 s2 s3 s4 s5 s6 s7
## TRUE 1 0 0 6 4 4 2
## FALSE 17 18 18 12 14 14 16
table(apply(org[,-5], 1, function(j) paste0(j==0,collapse='-')))
##
## FALSE-FALSE-FALSE-FALSE-FALSE-FALSE FALSE-FALSE-FALSE-FALSE-FALSE-TRUE
## 9 2
## FALSE-FALSE-FALSE-TRUE-FALSE-FALSE FALSE-FALSE-FALSE-TRUE-TRUE-FALSE
## 3 3
## TRUE-FALSE-FALSE-FALSE-TRUE-FALSE
## 1
We are loading both the original signatures, and the updated signatures.
We have two dataframes: with the previous TCGA samples and with the current ones. Both contain the BriTROC and ICGC to this as well (which are shared).
## The percentage of zeros in each cohort is:
## $organoids
## $organoids[[1]]
## [1] "13.492%"
##
##
## $ExposuresNatGen
## $ExposuresNatGen$britroc
## [1] "0%"
##
## $ExposuresNatGen$`OV-AU`
## [1] "0%"
##
## $ExposuresNatGen$`OV-US`
## [1] "0%"
##
## $ExposuresNatGen$TCGA
## [1] "0%"
##
##
## $UpdatedExposures
## $UpdatedExposures$britroc
## [1] "0%"
##
## $UpdatedExposures$`OV-AU`
## [1] "0%"
##
## $UpdatedExposures$`OV-US`
## [1] "0%"
##
## $UpdatedExposures$`Updated TCGA`
## [1] "24.305%"
This makes the organoids and the TCGA exposures sample, and leaves the other in the periphery of the PCA. I suspect this is due to the number of zero exposures, which are imputated using the robust analyses that I am using here:
We are only selecting the updated exposures, now
which_natgens = c('UpdatedExposures')
which_natgen = 'UpdatedExposures'
For compositional data, in the book Analysing compositional data with R they say that PCA should be done on clr-transformed data. Zeroes are an issue if we use clr using all samples. The robust clr is implemented in the package compositions and deals with this problem by doing the geometric mean over only non-zero values, and setting the clr of a part which is zero to zero.
The plot done with (biplot(princomp(acomp(x)))) is the same as plotting princomp(as(clr(x), ‘matrix’))
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Saving 7 x 5 in image
## Saving 7 x 5 in image
PCA of a uniform sample from the 3-dimensional simplex (i.e. four parts), with three different transformations
PCA of organoids with three variations
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Plot the loadings of the organoids (PCA only with organoids)
grouping_cohort_all_basic = c(natgen_metadata[[which_natgen]]$study, rep('organoids', nrow(org)))
grouping_cohort_all_basic[grouping_cohort_all_basic != "organoids"] = "primary"
With imputation of 0.01
With imputation of 0.01
s1 seems not to be of importance in the loadings of this last PCA. Use it as baseline for PCA (good, too, because it’s almost always strictly positive).
## Creating plot... it might take some time if the data are large. Number of samples: 18
PCA with imputation of 0.01, as in the dendrogram
PCA of ALR with imputation of 0.01, as in the dendrogram
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
org_rename = org
rownames(org_rename) = gsub("Sample ", "PDO", rownames(org))
grid.arrange(createBarplot(cbind(org_rename[,c(1,3,5,7)], sum=rowSums(org_rename[,c(2,4,6)])), remove_labels = FALSE, order_labels = names(sort(org_rename[,1]))) +
labs(y='Exposure')+ ggtitle('Exposures for the organoids')+theme(axis.text.x = element_text(angle = 45))+theme(legend.position = "bottom"),
createBarplot(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)]))), remove_labels = FALSE, order_labels = names(sort(org_rename[,1]))) +
labs(y='Exposure')+ ggtitle('Exposures for the organoids')+theme(axis.text.x = element_text(angle = 45))+theme(legend.position = "bottom"), nrow=1)
## Creating plot... it might take some time if the data are large. Number of samples: 18
## Creating plot... it might take some time if the data are large. Number of samples: 18
grid.arrange(give_pca(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)]))), names=rownames(org_rename)),
give_pca(as(compositions::clr(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)])))), 'matrix'), names=rownames(org_rename)),
give_pca(as(compositions::clr(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)])))), 'matrix'), names=rownames(org_rename),
give_loadings = T), nrow=1)
Note: s3 and s4 are consistently shown in the same PC axis, in opposite directions.
df_correlations_exposures_cohorts = data.frame(ALR_bsS1_s2=all_natgen[[which_natgen]][,2][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
ALR_bsS1_s4=all_natgen[[which_natgen]][,4][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
ALR_bsS1_s6=all_natgen[[which_natgen]][,6][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
colour=natgen_metadata[[which_natgen]]$study[1:nrow(natgen_metadata[[which_natgen]])]
)
grid.arrange(ggplot(df_correlations_exposures_cohorts, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s4))+
geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s4)'),
ggplot(df_correlations_exposures_cohorts, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s6))+
geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s6)'), nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
Pretty interesting plot, as it seems to show four driving forces whicc are orthogonal two-on-two. n1 (s1) and s2 (s3) are anticorrelated and driving most of the variance in organoids. The sum of s2,4,6, is orthogonal to both s1 and s3, and causes most subsequent variation.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: position_stack requires non-overlapping x intervals
Does the same happen in the cohorts?
give_pca(data_matrix = as(compositions::clr(normalise_rw(cbind(natgen[[which_natgen]][,c(1,3,7)],
sum=rowSums(natgen[[which_natgen]][,c(2,4,6)])))), 'matrix'),
names=rownames(natgen[[which_natgen]]), give_loadings = T, print_labels = T, print_both_labels=F)
give_pca(data_matrix = as(compositions::clr(impute(normalise_rw(cbind(natgen[[which_natgen]][,c(1,3,7)],
sum=rowSums(natgen[[which_natgen]][,c(2,4,6)]))), 1e-2)), 'matrix'),
names=rownames(natgen[[which_natgen]]), give_loadings = T, print_labels = T, print_both_labels=F)
par(mfrow=c(1,2))
dendroraw = give_dendrogram_generalised(all_natgen[[which_natgen]], modify_labels = F, keep_only_PDO = T)
dendroclr = give_dendrogram_generalised(as(compositions::clr(all_natgen[[which_natgen]]), 'matrix'), modify_labels = F, keep_only_PDO=T)
par(mfrow=c(1,4), mar=c(0.2,0.2,0.2,0.2))
dendroraw_pdo = give_dendrogram_generalised(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], modify_labels=F, keep_only_PDO = T)
dendroclr_pdo = give_dendrogram_generalised(as(compositions::clr(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),]), 'matrix'), modify_labels=F, keep_only_PDO = T)
dendroimput_pdo = give_dendrogram_generalised(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2), modify_labels=F, keep_only_PDO = T)
dendroimputclr_pdo = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = T)
dendroimputclr_all = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = F)
Number of samples for this clustering analysis:
dim(all_natgen[[which_natgen]])
## [1] 710 7
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## quartz_off_screen
## 2
Now doing the same with a different imputation value
cluster_clades <- cutree(dendroimputclr_all, k=2)[match(rownames(all_natgen[[which_natgen]]),
names(cutree(dendroimputclr_all, k=2)))]
saveRDS(cluster_clades, "robjects/cluster_clades_imput002clr.RDS")
give_pca(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-2)), 'matrix'), title='imputation+CLR, centered',
names=c(ifelse(grepl('PDO', rownames(all_natgen[[which_natgen]])),
yes=rownames(all_natgen[[which_natgen]]),
no=NA)),
give_loadings = T,
print_labels=T,
groups=factor(cluster_clades),
group_is_factor=F, groups_shape=NULL,
nrow_legend=1)+
theme_bw()
## Warning: Removed 692 rows containing missing values (geom_label_repel).
give_pca(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-3)), 'matrix'), title='imputation+CLR, centered',
names=c(ifelse(grepl('PDO', rownames(all_natgen[[which_natgen]])),
yes=rownames(all_natgen[[which_natgen]]),
no=NA)),
give_loadings = T,
print_labels=T,
groups=factor(cutree(dendroimputclr_all, k=2)[match(rownames(all_natgen[[which_natgen]]),
names(cutree(dendroimputclr_all, k=2)))]),
group_is_factor=F, groups_shape=NULL,
nrow_legend=1)+
theme_bw()
## Warning: Removed 692 rows containing missing values (geom_label_repel).
pairs( (org/org[,1])[,-1], main='ALR with S1')
pairs( (org/org[,2])[,-2], main='ALR with S2')
grid.arrange(ggplot(cbind.data.frame(s2=org[,'s2'], s4=org[,'s4'], lab=rownames(org)), aes(x=s2, y=s4, label=lab))+geom_point()+geom_label_repel(),
ggplot(cbind.data.frame(s2s1=org[,'s2']/org[,'s1'], s4s1=org[,'s4']/org[,'s1'], lab=rownames(org)), aes(x=s2s1, y=s4s1, label=lab))+geom_point()+geom_label_repel(), nrow=1)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
pairs( (all_natgen[[which_natgen]]/all_natgen[[which_natgen]][,1])[,-1], main='exp ALR with s1 as baseline')
pairs( (all_natgen[[which_natgen]]/all_natgen[[which_natgen]][,2])[,-2], main='exp ALR with s2 as baseline')
pairs( (impute(all_natgen[[which_natgen]], 1e-2)/impute(all_natgen[[which_natgen]], 1e-2)[,2])[,-2], main='ALR with s2 as baseline with imputation 1e-2')
alr_s2_natgen_imput002 <- exp(as(compositions::alr(impute(all_natgen[[which_natgen]], 1e-2), ivar = 2), 'matrix'))
alr_s2_natgen <- exp(as(compositions::alr(all_natgen[[which_natgen]], ivar = 2), 'matrix'))
alr_s1_natgen <- exp(as(compositions::alr(all_natgen[[which_natgen]], ivar = 1), 'matrix'))
pairs( alr_s2_natgen_imput002,
main='exp ALR with s2 as baseline with imputation 1e-2 (compositions package)')
head(cbind.data.frame(alr_s1_natgen, cluster_clades))
## s2 s3 s4 s5 s6 s7
## TCGA-04-1331 0.00000000 0.0000000 0.4675971 0.08074177 0.0000000 0.3958399
## TCGA-04-1332 0.12787207 0.0000000 0.3178997 0.24563898 0.2006571 0.7096114
## TCGA-04-1335 0.00000000 0.1892053 0.0000000 0.01740368 0.0000000 0.1488342
## TCGA-04-1336 0.00000000 0.0000000 0.5955696 0.00000000 0.0000000 0.2405088
## TCGA-04-1337 0.07024971 0.0000000 0.1129509 0.00000000 0.0000000 0.0000000
## TCGA-04-1338 8.22150432 2.4660602 2.6013353 2.04572519 2.3362734 28.7330851
## cluster_clades
## TCGA-04-1331 1
## TCGA-04-1332 1
## TCGA-04-1335 2
## TCGA-04-1336 1
## TCGA-04-1337 2
## TCGA-04-1338 1
ggplot(data.frame(alr_s2_natgen), aes(x=s3, y=s4))+geom_point()+theme_bw()
## Warning: Removed 180 rows containing missing values (geom_point).
head(data.frame(alr_s1_natgen))
## s2 s3 s4 s5 s6 s7
## TCGA-04-1331 0.00000000 0.0000000 0.4675971 0.08074177 0.0000000 0.3958399
## TCGA-04-1332 0.12787207 0.0000000 0.3178997 0.24563898 0.2006571 0.7096114
## TCGA-04-1335 0.00000000 0.1892053 0.0000000 0.01740368 0.0000000 0.1488342
## TCGA-04-1336 0.00000000 0.0000000 0.5955696 0.00000000 0.0000000 0.2405088
## TCGA-04-1337 0.07024971 0.0000000 0.1129509 0.00000000 0.0000000 0.0000000
## TCGA-04-1338 8.22150432 2.4660602 2.6013353 2.04572519 2.3362734 28.7330851
ggplot(cbind.data.frame(alr_s1_natgen, cluster_clades),
aes(x=s4, y=s7, col=cluster_clades))+
geom_point(alpha=0.2)+facet_wrap(.~cluster_clades)+theme_bw()+guides(col=F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 1 rows containing missing values (geom_point).
ggplot(cbind.data.frame(alr_s2_natgen, cluster_clades),
aes(x=s1, y=s3, col=cluster_clades))+
geom_point(alpha=0.2)+facet_wrap(.~cluster_clades)+theme_bw()+guides(col=F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 117 rows containing missing values (geom_point).
The fraction of zero exposures in the s3-rich clade is - lower in s1 (stat.signif.; p-value = 3.159e-07) - lower in s3 (stat.signif.; p-value < 2.2e-16) - higher in s4 (stat.signif.; p-value < 2.2e-16) - higher in s6 (stat.signif.; but marginally) - higher in s7 (stat.signif.; p-value = 1.409e-05)
zeros_in_clades <- lapply(split(f = cluster_clades,
x = cbind.data.frame(all_natgen[[which_natgen]])),
function(i) colSums(i==0))
zeros_in_clades
## $`1`
## s1 s2 s3 s4 s5 s6 s7
## 34 151 214 7 44 102 3
##
## $`2`
## s1 s2 s3 s4 s5 s6 s7
## 0 96 15 134 18 83 16
cluster_clades_size <- table(cluster_clades)
zeros_in_clades[[1]]/cluster_clades_size[1]
## s1 s2 s3 s4 s5 s6
## 0.074235808 0.329694323 0.467248908 0.015283843 0.096069869 0.222707424
## s7
## 0.006550218
zeros_in_clades[[2]]/cluster_clades_size[2]
## s1 s2 s3 s4 s5 s6 s7
## 0.00000000 0.38095238 0.05952381 0.53174603 0.07142857 0.32936508 0.06349206
lapply(1:7, function(i) fisher.test(matrix(c(zeros_in_clades[[1]][i],
cluster_clades_size[1]-zeros_in_clades[[1]][i],
zeros_in_clades[[2]][i],
cluster_clades_size[2]-zeros_in_clades[[2]][i]),
ncol=2 )))
## [[1]]
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 3.159e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 5.118443 Inf
## sample estimates:
## odds ratio
## Inf
##
##
## [[2]]
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 0.1878
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.5733129 1.1165359
## sample estimates:
## odds ratio
## 0.7995343
##
##
## [[3]]
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 7.891665 25.872815
## sample estimates:
## odds ratio
## 13.81518
##
##
## [[4]]
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.005273055 0.030204872
## sample estimates:
## odds ratio
## 0.01376754
##
##
## [[5]]
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 0.3308
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.7611102 2.6011438
## sample estimates:
## odds ratio
## 1.381039
##
##
## [[6]]
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 0.002358
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.4085098 0.8351804
## sample estimates:
## odds ratio
## 0.5838318
##
##
## [[7]]
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(zeros_in_clades[[1]][i], cluster_clades_size[1] - zeros_in_clades[[1]][i], zeros_in_clades[[2]][i], cluster_clades_size[2] - zeros_in_clades[[2]][i]), ncol = 2)
## p-value = 1.409e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.01805523 0.34568499
## sample estimates:
## odds ratio
## 0.09753283
pairs( (normalise_rw(all_natgen[[which_natgen]][,-5])/all_natgen[[which_natgen]][,1])[,-1], main='ALR with s1 as baseline, without s5')
sum(grepl('PDO', rownames(all_natgen[[which_natgen]])))
## [1] 18
sum(grepl('org', rownames(all_natgen[[which_natgen]])))
## [1] 0
which_natgen
## [1] "UpdatedExposures"
all_natgen_bool <- all_natgen[[which_natgen]] == 0
all_natgen_without_orgs <- all_natgen[[which_natgen]][!grepl('PDO', rownames(all_natgen[[which_natgen]])),]
all_natgen_without_orgs_imput_alr <- as(compositions::alr(normalise_rw(impute(all_natgen_without_orgs, 1e-2) ), ivar = 1), 'matrix')
orgs_imput_alr <- as(compositions::alr(normalise_rw(impute(org, 1e-2) ), ivar = 1), 'matrix')
orgs_bool <- org == 0
dim(all_natgen_without_orgs_imput_alr)
## [1] 692 6
all_natgen_bool_without_orgs <- all_natgen_bool[!grepl('PDO', rownames(all_natgen[[which_natgen]])),]
cooc_zeros <- outer(1:7, 1:7, Vectorize(function(i,j){
tab <- (all_natgen_bool_without_orgs[,i] == all_natgen_bool_without_orgs[,j])
sum(tab)/nrow(all_natgen_bool)
}))
cooc_zeros_org <- outer(1:7, 1:7, Vectorize(function(i,j){
tab <- (orgs_bool[,i] == orgs_bool[,j])
sum(tab)/nrow(orgs_bool)
}))
pearson_imput <- outer(1:7, 1:7, Vectorize(function(i,j){
# mean(all_natgen_without_orgs[,i]+1e-2) - mean(all_natgen_without_orgs[,j]+1e-2)
stats::cor(x = normalise_rw(all_natgen_without_orgs[,i]+1e-2 ),
y = normalise_rw(all_natgen_without_orgs[,j]+1e-2))
}))
pearson_imput_alrs1 <- outer(1:6, 1:6, Vectorize(function(i,j){
stats::cor(x = all_natgen_without_orgs_imput_alr[,i],
y = all_natgen_without_orgs_imput_alr[,j])
}))
pearson_imput_alrs1_orgs <- outer(1:6, 1:6, Vectorize(function(i,j){
stats::cor(x = orgs_imput_alr[,i],
y = orgs_imput_alr[,j])
}))
pearson_imput
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00000000 -0.38969741 0.02715551 -0.363719493 -0.32854049 -0.2413427
## [2,] -0.38969741 1.00000000 0.05049182 -0.050973286 -0.02547167 0.1278399
## [3,] 0.02715551 0.05049182 1.00000000 -0.510074854 0.09819571 -0.2462266
## [4,] -0.36371949 -0.05097329 -0.51007485 1.000000000 -0.25268794 0.1469121
## [5,] -0.32854049 -0.02547167 0.09819571 -0.252687937 1.00000000 -0.1509271
## [6,] -0.24134265 0.12783991 -0.24622663 0.146912135 -0.15092710 1.0000000
## [7,] -0.47633404 -0.15723314 -0.24460908 -0.003010673 0.18689540 -0.2303376
## [,7]
## [1,] -0.476334044
## [2,] -0.157233143
## [3,] -0.244609078
## [4,] -0.003010673
## [5,] 0.186895401
## [6,] -0.230337551
## [7,] 1.000000000
rownames(cooc_zeros) <- colnames(cooc_zeros) <- rownames(cooc_zeros_org) <- colnames(cooc_zeros_org) <- paste0('s', 1:7)
rownames(pearson_imput) <- colnames(pearson_imput) <- paste0('s', 1:7)
rownames(pearson_imput_alrs1) <- colnames(pearson_imput_alrs1) <- rownames(pearson_imput_alrs1_orgs) <- colnames(pearson_imput_alrs1_orgs) <- paste0('ALR_s1(', 1:6, ')')
pheatmap(cooc_zeros, main = "Co-occurrence of zeros", cluster_r = F, cluster_c = F)
pheatmap(cooc_zeros_org, main = "Orgs Co-occurrence of zeros", cluster_r = F, cluster_c = F)
pheatmap(pearson_imput, main = "Pearson correlation of imputated activities (prob)", cluster_r = F, cluster_c = F)
pheatmap(pearson_imput_alrs1, main = "Pearson correlation of imputated activities (ALR s1)", cluster_r = F, cluster_c = F)
pheatmap(pearson_imput_alrs1_orgs, main = "Orgs Pearson correlation of imputated activities (ALR s1)", cluster_r = F, cluster_c = F)
dim(all_natgen_bool)
## [1] 710 7
dim(all_natgen_bool_without_orgs)
## [1] 692 7
df_correlations_exposures_organoids = data.frame(ALR_bsS1_s2=org[,2]/org[,1],
ALR_bsS1_s4=org[,4]/org[,1],
ALR_bsS1_s6=org[,6]/org[,1])
grid.arrange(ggplot(df_correlations_exposures_organoids, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s4))+
geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s4)'),
ggplot(df_correlations_exposures_organoids, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s6))+
geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s6)'), nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
The imputation+clr dendrograms are not the same when considering only the organoids than when using all samples from TCGA, but in general the two agree in which organoids get clustered and the overal structure of the dendrogram.
# dendroimputclr_all = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = F)
dendroimputclrimpute_org <- give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = T, plot_dendro=F)
hclust_rnaseq <- readRDS("../RNASeq_DE_resistant_sensitive/objects/hclust_rnaseq.RDS")
tanglegram(as.dendrogram (dendroimputclrimpute_org), as.dendrogram(dendroimputclr_all), main='PDO-only clustering vs all exposures', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.
tanglegram(as.dendrogram (dendroimputclrimpute_org), as.dendrogram(hclust_rnaseq), main='PDO-only clustering vs RNA-Seq clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.
tanglegram(as.dendrogram (dendroimputclr_all), as.dendrogram(hclust_rnaseq), main='All exposures clustering vs RNA-Seq clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.
Now with TPM
hclust_rnaseq_TPM <- readRDS("../RNASeq_DE_resistant_sensitive/objects/hclust_rnaseq_TPM.RDS")
tanglegram(as.dendrogram (dendroimputclrimpute_org), as.dendrogram(hclust_rnaseq_TPM), main='PDO-only clustering vs RNA-Seq TPM clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.
tanglegram(as.dendrogram (dendroimputclr_all), as.dendrogram(hclust_rnaseq_TPM), main='All exposures clustering vs RNA-Seq TPM clustering', cex_main=1)
## Warning in intersect_trees(dend1, dend2, warn = TRUE): The labels in both tree
## had different values - trees were pruned.
tanglegram(as.dendrogram (hclust_rnaseq), as.dendrogram(hclust_rnaseq_TPM), main='RNA-Seq clustering vs RNA-Seq TPM clustering', cex_main=1)
additional genomic data comparing the tumours to the organoids in terms of ploidy, number of rearrangements and any other things that you think could be relevant
pcawg_CN_features = readRDS("data/pcawg_CN_features.rds")
tcga_CN_features = readRDS("data/tcga_CN_features.rds")
BriTROC_absolute_copynumber = readRDS("../../cnsignatures/manuscript_Rmarkdown/data/BriTROC_absolute_copynumber.rds")
BriTROC2_CN_features = readRDS("data/6_TCGA_Signatures_on_BRITROC/0_BRITROC_absolute_CN.rds")
organoids_absolute_copynumber = readRDS("data/organoid_absolute_CN.rds")
sampleNames(organoids_absolute_copynumber) = names_orgs$`new name`[match(gsub("org", "", sampleNames(organoids_absolute_copynumber)), names_orgs$`old name`)]
organoids_CN_features = extractCopynumberFeatures(organoids_absolute_copynumber)
BriTROC_CN_features = readRDS("data/BriTROC_CN_features.rds")
The number of segments can be taken either from segsize (first column) or from copynumber (last column). This is just for PCAWG and TCGA! Not for BriTROC. Any idea why this is the case?
** Note: I am plotting this as the log!**
** Note 2: I am pooling together segments from several samples and analysing them all together! I.e. I have not averaged over samples **
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## quartz_off_screen
## 2
distrib_segsize = create_distrib_df('segsize', 'value')
distrib_bp10MB = create_distrib_df('bp10MB', 'value')
distrib_osCN = create_distrib_df('osCN', 'value')
distrib_bpchrarm = create_distrib_df('bpchrarm', 'ct1')
distrib_changepoint = create_distrib_df('changepoint', 'value')
distrib_copynumber = create_distrib_df('copynumber', 'value')
#------------------------------------------------------------------------------------------------#
## 1/6 breakpoints per 10MB
grid.arrange(give_joint_histogram(list(distrib_bp10MB[['org']]$`mean(ct1_num)`,
distrib_bp10MB[['pcawg']]$`mean(ct1_num)`,
distrib_bp10MB[['tcga']]$`mean(ct1_num)`,
distrib_bp10MB[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE),
give_joint_histogram(list(distrib_bp10MB[['org']]$`mean(ct1_num)` %>% log,
distrib_bp10MB[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_bp10MB[['tcga']]$`mean(ct1_num)` %>% log,
distrib_bp10MB[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
t.test(distrib_bp10MB[['org']]$`mean(ct1_num)` %>% log,
c(distrib_bp10MB[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_bp10MB[['tcga']]$`mean(ct1_num)` %>% log,
distrib_bp10MB[['BriTROC']]$`mean(ct1_num)` %>% log))
##
## Welch Two Sample t-test
##
## data: distrib_bp10MB[["org"]]$`mean(ct1_num)` %>% log and c(distrib_bp10MB[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_bp10MB[["tcga"]]$`mean(ct1_num)` %>% log, distrib_bp10MB[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = -0.52491, df = 18.513, p-value = 0.6059
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3415468 0.2047773
## sample estimates:
## mean of x mean of y
## -0.8415706 -0.7731859
#------------------------------------------------------------------------------------------------#
## 2/6 segment size
grid.arrange(give_joint_histogram(list(distrib_segsize[['org']]$`mean(ct1_num)`,
distrib_segsize[['pcawg']]$`mean(ct1_num)`,
distrib_segsize[['tcga']]$`mean(ct1_num)`,
distrib_segsize[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_segsize[['org']]$`mean(ct1_num)` %>% log,
distrib_segsize[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_segsize[['tcga']]$`mean(ct1_num)` %>% log,
distrib_segsize[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
t.test(distrib_segsize[['org']]$`mean(ct1_num)` %>% log,
c(distrib_segsize[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_segsize[['tcga']]$`mean(ct1_num)` %>% log,
distrib_segsize[['BriTROC']]$`mean(ct1_num)` %>% log))
##
## Welch Two Sample t-test
##
## data: distrib_segsize[["org"]]$`mean(ct1_num)` %>% log and c(distrib_segsize[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_segsize[["tcga"]]$`mean(ct1_num)` %>% log, distrib_segsize[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = 0.77176, df = 18.338, p-value = 0.4501
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1435866 0.3106786
## sample estimates:
## mean of x mean of y
## 16.69896 16.61541
#------------------------------------------------------------------------------------------------#
## 3/6 oscillating CN
grid.arrange(give_joint_histogram(list(distrib_osCN[['org']]$`mean(ct1_num)`,
distrib_osCN[['pcawg']]$`mean(ct1_num)`,
distrib_osCN[['tcga']]$`mean(ct1_num)`,
distrib_osCN[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_osCN[['org']]$`mean(ct1_num)` %>% log,
distrib_osCN[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_osCN[['tcga']]$`mean(ct1_num)` %>% log,
distrib_osCN[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
## Warning: Removed 15 rows containing non-finite values (stat_bin).
t.test(distrib_osCN[['org']]$`mean(ct1_num)` %>% log,
remove_infty(c(distrib_osCN[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_osCN[['tcga']]$`mean(ct1_num)` %>% log,
distrib_osCN[['BriTROC']]$`mean(ct1_num)` %>% log)))
##
## Welch Two Sample t-test
##
## data: distrib_osCN[["org"]]$`mean(ct1_num)` %>% log and remove_infty(c(distrib_osCN[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_osCN[["tcga"]]$`mean(ct1_num)` %>% log, distrib_osCN[["BriTROC"]]$`mean(ct1_num)` %>% log))
## t = 0.36511, df = 18.948, p-value = 0.7191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1355887 0.1928764
## sample estimates:
## mean of x mean of y
## -0.5555398 -0.5841836
#------------------------------------------------------------------------------------------------#
## 4/6 num of breakpoints per chromosome arm
grid.arrange(give_joint_histogram(list(distrib_bpchrarm[['org']]$`mean(ct1_num)`,
distrib_bpchrarm[['pcawg']]$`mean(ct1_num)`,
distrib_bpchrarm[['tcga']]$`mean(ct1_num)`,
distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_bpchrarm[['org']]$`mean(ct1_num)` %>% log,
distrib_bpchrarm[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_bpchrarm[['tcga']]$`mean(ct1_num)` %>% log,
distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
## Same distribution of num of breakpoints per chromosome arm
t.test(distrib_bpchrarm[['org']]$`mean(ct1_num)` %>% log,
c(distrib_bpchrarm[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_bpchrarm[['tcga']]$`mean(ct1_num)` %>% log,
distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)` %>% log))
##
## Welch Two Sample t-test
##
## data: distrib_bpchrarm[["org"]]$`mean(ct1_num)` %>% log and c(distrib_bpchrarm[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_bpchrarm[["tcga"]]$`mean(ct1_num)` %>% log, distrib_bpchrarm[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = -0.53247, df = 18.515, p-value = 0.6007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3425379 0.2037960
## sample estimates:
## mean of x mean of y
## 1.078022 1.147393
#------------------------------------------------------------------------------------------------#
## 5/6 num of changepoints
grid.arrange(give_joint_histogram(list(distrib_changepoint[['org']]$`mean(ct1_num)`,
distrib_changepoint[['pcawg']]$`mean(ct1_num)`,
distrib_changepoint[['tcga']]$`mean(ct1_num)`,
distrib_changepoint[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_changepoint[['org']]$`mean(ct1_num)` %>% log,
distrib_changepoint[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_changepoint[['tcga']]$`mean(ct1_num)` %>% log,
distrib_changepoint[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
## Same distribution of num of changepoints
t.test(distrib_changepoint[['org']]$`mean(ct1_num)` %>% log,
c(distrib_changepoint[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_changepoint[['tcga']]$`mean(ct1_num)` %>% log,
distrib_changepoint[['BriTROC']]$`mean(ct1_num)` %>% log))
##
## Welch Two Sample t-test
##
## data: distrib_changepoint[["org"]]$`mean(ct1_num)` %>% log and c(distrib_changepoint[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_changepoint[["tcga"]]$`mean(ct1_num)` %>% log, distrib_changepoint[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = 0.19539, df = 18.554, p-value = 0.8472
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1691729 0.2039479
## sample estimates:
## mean of x mean of y
## 0.4200699 0.4026824
#------------------------------------------------------------------------------------------------#
## 6/6 copy number
grid.arrange(give_joint_histogram(list(distrib_copynumber[['org']]$`mean(ct1_num)`,
distrib_copynumber[['pcawg']]$`mean(ct1_num)`,
distrib_copynumber[['tcga']]$`mean(ct1_num)`,
distrib_copynumber[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_copynumber[['org']]$`mean(ct1_num)` %>% log,
distrib_copynumber[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_copynumber[['tcga']]$`mean(ct1_num)` %>% log,
distrib_copynumber[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
## Same distribution of copy number of segments
t.test(distrib_copynumber[['org']]$`mean(ct1_num)` %>% log,
c(distrib_copynumber[['pcawg']]$`mean(ct1_num)` %>% log,
distrib_copynumber[['tcga']]$`mean(ct1_num)` %>% log,
distrib_copynumber[['BriTROC']]$`mean(ct1_num)` %>% log))
##
## Welch Two Sample t-test
##
## data: distrib_copynumber[["org"]]$`mean(ct1_num)` %>% log and c(distrib_copynumber[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_copynumber[["tcga"]]$`mean(ct1_num)` %>% log, distrib_copynumber[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = 0.35387, df = 18.18, p-value = 0.7275
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1286706 0.1808402
## sample estimates:
## mean of x mean of y
## 1.160435 1.134350
Explanation: The distribution of sample-averaged values of features are the sample between organoids and the cohorts of primary tissue samples for the number of breakpoints per 10MB (Welch Two Sample t-test on log-transformed data, p-value = 0.6059), segment size (Welch Two Sample t-test on log-transformed data, p-value = 0.4501 ), oscillating copy number (Welch Two Sample t-test on log-transformed data, p-value = 0.7191), number of breakpoints per chromosome arm (Welch Two Sample t-test on log-transformed data, p-value = 0.6007), number of changepoints (Welch Two Sample t-test on log-transformed data, p-value = 0.8472), and the copy number of the segments (Welch Two Sample t-test on log-transformed data, p-value = 0.7275).
TL;DR with a negative binomial model, which is much more appropriate in this setting than a Poisson, there is no difference in the distributions of organoids and non-organodis when it comes to number of segments.
## Analysis of Deviance Table
##
## Model 1: length ~ 1
## Model 2: length ~ bool
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 824 58116
## 2 823 58024 1 91.318 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in anova.negbin(...): only Chi-squared LR tests are implemented
## Likelihood ratio tests of Negative Binomial Models
##
## Response: length
## Model theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 1 3.076226 824 -9971.807
## 2 bool 3.081258 823 -9970.344 1 vs 2 1 1.46259 0.2265185
sapply(list(glm_poisson_length[glm_poisson_length$names == 'organoids','length']), function(i) c(mean(i), sd(i)))
## [,1]
## [1,] 169.0556
## [2,] 77.4866
sapply(list(glm_poisson_length[glm_poisson_length$names != 'organoids','length']), function(i) c(mean(i), sd(i)))
## [,1]
## [1,] 200.4002
## [2,] 134.0146
Unfortunately the scaling factor has to do with the width of the bins in the histogram.
To get the ploidy, I just have to compute the weighted average of the copy number segments (this is computed from the absolute copy number profiles objects, since they specify, for each segment, its length and its ploidy).
Use getSegTable to get the segments from this Biobase file
Needs to be ammended: it has to also count areas of the genome for which we don’t have segments! i.e. we need to know the effective genome size
## we only want the ovarian ones
ICGC_absolute_copynumber_AU = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-AU.segments.raw.rds")
ICGC_absolute_copynumber_US = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-US.segments.raw.rds")
ICGC_absolute_copynumber_AU = ICGC_absolute_copynumber_AU[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]
ICGC_absolute_copynumber_US = ICGC_absolute_copynumber_US[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]
segtables_ICGC_absolute_copynumber_AU = lapply(sort(unique(ICGC_absolute_copynumber_AU$sample)),
function(samplename)
ICGC_absolute_copynumber_AU[ICGC_absolute_copynumber_AU$sample == samplename,])
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_AU) = unique(ICGC_absolute_copynumber_AU$sample)
segtables_ICGC_absolute_copynumber_US = lapply(sort(unique(ICGC_absolute_copynumber_US$sample)),
function(samplename) ICGC_absolute_copynumber_US[ICGC_absolute_copynumber_US$sample == samplename,])
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_US) = unique(ICGC_absolute_copynumber_US$sample)
## for ICGC, remove the samples row and put it in the rows
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i){
rownames(i) = i$samples
i = i[,-1]
i})
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i){
rownames(i) = i$samples
i = i[,-1]
i})
## Check that there are no sex chromosomes included anywhere
## [1] TRUE TRUE TRUE TRUE TRUE
system("mkdir -p data/clean_segtables/")
saveRDS(segtables_BriTROC_absolute_copynumber, "data/clean_segtables/segtables_BriTROC_absolute_copynumber.RDS")
saveRDS(segtables_ICGC_absolute_copynumber_US, "data/clean_segtables/segtables_ICGC_absolute_copynumber_US.RDS")
saveRDS(segtables_ICGC_absolute_copynumber_AU, "data/clean_segtables/segtables_ICGC_absolute_copynumber_AU.RDS")
saveRDS(segtables_TCGA_absolute_copynumber, "data/clean_segtables/segtables_TCGA_absolute_copynumber.RDS")
In the previous computation of the sample ploidy (ploidy_ICGC_US_previous, etc.) I had not included normal segments. Now I do but actually the result is very similar. (Note slghtly under- and over-estimated values at the right and left of ploidy=2).
plot(log(c(ploidy_ICGC_US_previous, ploidy_ICGC_AU_previous, ploidy_TCGA_previous,
ploidy_BriTROC_previous, ploidy_organoids_previous)),
log(c(ploidy_ICGC_US, ploidy_ICGC_AU, ploidy_TCGA, ploidy_BriTROC, ploidy_organoids)),
ylab='log ploidy of samples with current mean ploidy calculation',
xlab='log ploidy of samples with previous mean ploidy calculation')
abline(coef=c(0,1), lty='dashed')
abline(v=log(2), lty='dashed')
Ploidy is not normally distributed and it’s right-skewed. Moreover, the distribution is bimodal: I guess there are genomes in which there is a clear amplification and genomes which are more or less normal, so centered around 2.
I am also using a robust linear regression, but I don’t think this is suitable either.
t.test(log(ploidy_organoids), log(ploidy_BriTROC))
##
## Welch Two Sample t-test
##
## data: log(ploidy_organoids) and log(ploidy_BriTROC)
## t = -0.90962, df = 20.948, p-value = 0.3734
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.16268846 0.06368736
## sample estimates:
## mean of x mean of y
## 0.9401143 0.9896149
MASS::rlm(ploidy~group,
data=cbind.data.frame(ploidy=c(ploidy_organoids, ploidy_BriTROC), group=c(rep(1,length(ploidy_organoids)), rep(2, length(ploidy_BriTROC)))))
## Call:
## rlm(formula = ploidy ~ group, data = cbind.data.frame(ploidy = c(ploidy_organoids,
## ploidy_BriTROC), group = c(rep(1, length(ploidy_organoids)),
## rep(2, length(ploidy_BriTROC)))))
## Converged in 5 iterations
##
## Coefficients:
## (Intercept) group
## 2.4791559 0.1280964
##
## Degrees of freedom: 298 total; 296 residual
## Scale estimate: 0.675
## Segments across the genome
# (sapply(chrlen$V1, function(i) gsub("chr", "", i)))
sorted_chroms = chrlen$V1[order(as.numeric((gsub("chr", "", chrlen$V1))))]
## Warning in eval(quote(list(...)), env): NAs introduced by coercion
chrom_lenths = chrlen[match(sorted_chroms, chrlen$V1),]
## quartz_off_screen
## 2
## quartz_off_screen
## 2
{r, print_organoids_absolute_copynumber} # head(organoids_absolute_copynumber) #Amplifications/deletions of selected genes in organoids
## PDO16 PDO15 PDO3 PDO9 PDO5 PDO6 PDO10 PDO1
## 1.752252 2.706511 2.921865 2.825767 2.617572 2.639571 2.886007 3.659920
## PDO12 PDO4 PDO2 PDO18 PDO7 PDO17 PDO8 PDO13
## 1.865682 2.818707 2.595914 2.884280 1.719599 2.864533 1.715556 2.753563
## PDO14 PDO11
## 2.805061 3.042383
How come the segments include all the genome?
full_GR_example <- c(GRanges_chroms, as(rename_cols(data.frame(segtables_ICGC_absolute_copynumber_US[[1]])), "GRanges"))
granges_example <- GenomicRanges::disjoin(full_GR_example, with.revmap=TRUE, ignore.strand=TRUE)
diploid_segments = granges_example[sapply(granges_example$revmap, function(i) !any(i > length(GRanges_chroms))),]
diploid_segments
## GRanges object with 68 ranges and 1 metadata column:
## seqnames ranges strand | revmap
## <Rle> <IRanges> <Rle> | <IntegerList>
## [1] 1 8799389-8863518 * | 1
## [2] 1 121352501-128899999 * | 1
## [3] 1 247249601-249250621 * | 1
## [4] 2 1-20015 * | 2
## [5] 2 242985494-243199373 * | 2
## ... ... ... ... . ...
## [64] 20 62615607-63025520 * | 19
## [65] 21 1-14413101 * | 22
## [66] 21 46937501-48129895 * | 22
## [67] 22 1-12199999 * | 21
## [68] 22 49687501-51304566 * | 21
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
length(diploid_segments)
## [1] 68
chrlen
## V1 V2
## 1 chr1 249250621
## 2 chr2 243199373
## 3 chr3 198022430
## 4 chr4 191154276
## 5 chr5 180915260
## 6 chr6 171115067
## 7 chr7 159138663
## 8 chrX 155270560
## 9 chr8 146364022
## 10 chr9 141213431
## 11 chr10 135534747
## 12 chr11 135006516
## 13 chr12 133851895
## 14 chr13 115169878
## 15 chr14 107349540
## 16 chr15 102531392
## 17 chr16 90354753
## 18 chr17 81195210
## 19 chr18 78077248
## 20 chr20 63025520
## 21 chrY 59373566
## 22 chr19 59128983
## 23 chr22 51304566
## 24 chr21 48129895
dim(natgen[[which_natgen]])
## [1] 692 7
as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))])
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# plot(as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))]),
# org[,'s6'])
# plot(as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))]),
# org[,'s6']/org[,'s1'])
sort(number_of_segments$organoids)
##
## PDO1 PDO11 PDO18 PDO16 PDO7 PDO12 PDO8 PDO13 PDO17 PDO2 PDO6 PDO5 PDO10
## 64 64 113 118 120 130 137 149 155 165 174 176 178
## PDO3 PDO9 PDO15 PDO14 PDO4
## 202 205 212 288 393
## MYC: chr8:128,747,680-128,755,197
lapply(names(segtables_organoids_absolute_copynumber), function(name_org){
give_plot_areas_of_interest(segtab=segtables_organoids_absolute_copynumber[[name_org]],
chrom_AOI=8, start_AOI=118747680,
end_AOI=118755197,
subclonal_line1=2, subclonal_line2=3, title=paste0('MYC ', name_org))
})
## [[1]]
## [[1]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,52 1.95097
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[1]][[2]]
##
##
## [[2]]
## [[2]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,104 8.89182
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[2]][[2]]
##
##
## [[3]]
## [[3]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,125 6.28908
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[3]][[2]]
##
##
## [[4]]
## [[4]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,129 6.38458
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[4]][[2]]
##
##
## [[5]]
## [[5]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,72 3.90838
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[5]][[2]]
##
##
## [[6]]
## [[6]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,75 4.13676
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[6]][[2]]
##
##
## [[7]]
## [[7]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,102 4.31683
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[7]][[2]]
##
##
## [[8]]
## [[8]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,32 9.61704
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[8]][[2]]
##
##
## [[9]]
## [[9]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,69 2.02661
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[9]][[2]]
##
##
## [[10]]
## [[10]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,228 3.05402
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[10]][[2]]
##
##
## [[11]]
## [[11]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,81 2.05615
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[11]][[2]]
##
##
## [[12]]
## [[12]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,56 3.06459
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[12]][[2]]
##
##
## [[13]]
## [[13]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,39 2.65679
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[13]][[2]]
##
##
## [[14]]
## [[14]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,82 3.95907
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[14]][[2]]
##
##
## [[15]]
## [[15]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,49 2.97383
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[15]][[2]]
##
##
## [[16]]
## [[16]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,76 3.9072
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[16]][[2]]
##
##
## [[17]]
## [[17]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,148 4.97256
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[17]][[2]]
##
##
## [[18]]
## [[18]][[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | revmap CNval
## <Rle> <IRanges> <Rle> | <IntegerList> <numeric>
## [1] 8 118747680-118755197 * | 1,28 5.91658
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[18]][[2]]
sapply(plts_AOI_subclonal, `[`, 1)